Features and dimensions: Motion estimation in fly vision 
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We characterize the computation of motion in the fly visual system as a mapping from the high 
dimensional space of signals in the retinal photodetector array to the probability of generating an 
action potential in a motion sensitive neuron. Our approach to this problem identifies a low dimen- 
sional subspace of signals within which the neuron is most sensitive, and then samples this subspace 
to visualize the nonlinear structure of the mapping. The results illustrate the computational strate- 
gies predicted for a system that makes optimal motion estimates given the physical noise sources 
in the detector array. More generally, the hypothesis that neurons are sensitive to low dimensional 
subspaces of their inputs formalizes the intuitive notion of feature selectivity and suggests a strategy 
for characterizing the neural processing of complex, naturalistic sensory inputs. 



I. INTRODUCTION 

Vision begins with the counting of photons by a 
large array of detector elements in the retina. From 
these inputs, the brain is thought to extract features, 
such as the edges in an image or the velocity of motion 
across the visual field, out of which our perceptions are 
constructed. In some cases we can point to individual 
neurons in the brain that represent the output of this 
feature extraction; classic examples include the center- 
surround comparison encoded by retinal ganglion cells 
PJ , the orientation selective "edge detectors" described 
by Hubel and Wiesel in primary visual cortex 0] , and 
the direction selective, motion sensitive neurons found 
in visual systems from flies to rabbits to primates 
As emphasized by Marr, feature extraction seems 
such a natural and elementary step in sensory signal 
processing that it is easy to overlook the challenges 
posed by these computations @ . 

Our focus in this paper is on the computations lead- 
ing to the extraction of motion across the visual field, 
although we believe that the key issues are common 
to many different problems in neural computation. In 
primates the spike trains of motion sensitive neurons 
are correlated with perceptual decision making on a 
trial-by-trial basis 0, lesions to populations of these 
neurons produce specific behavioral deficits Q, and 
stimulation of the population can bias both percep- 
tual decisions and more graded visuomotor behav- 
iors 10] . In insects, ablation of single motion sen- 
sitive neurons leads to deficits of visuomotor behav- 
ior that match the spatial and directional tuning of 
the individual neurons 0, and one can use the spike 
sequences from these cells to estimate the trajectory 
of motion across the visual field or to distininguish 
among subtly different trajectories 01 • Strikingly, at 
least under some conditions the precision of these mo- 
tion estimates approaches the physical limits set by 



diffraction and photon shot noise at the visual input 
|T3IT3.IT^ |. Taken together these observations suggest 
strongly that the motion sensitive neurons represent 
most of what the organism knows about visual mo- 
tion, and in some cases everything that the organism 
could know given the physical signals and noise in the 
retina. 

It is tempting to think that the stimulus for a motion 
sensitive neuron is the velocity of motion across the 
visual field, but this is wrong: the input to all visual 
computation is a representation of the spatiotemporal 
history of light intensity falling on the retina, I(x.,t). 
This representation is approximate, first because the 
physical carrier, a photon stream, is inherently noisy, 
and second because the intensity pattern is blurred by 
the optics, and sampled in a discrete raster. Features, 
such as velocity, must be computed explicitly from 
this raw input stream. As discussed below, even the 
simplest visual computations have access to D ~ 10 2 
spacetime samples of I(x,t). If the response of a sin- 
gle neuron were an arbitrary function on a space of 
100 dimensions, then no reasonable experiment would 
be sufficient to characterize the computation that is 
represented by the neuron's output spike train. Any 
method for characterizing the mapping from the vi- 
sual input /(x, t) to an estimate of motion as encoded 
by a motion sensitive neuron must thus involve some 
simplifying assumptions. 

Models for the neural computation of motion go 
back (at least) to the classic work of Hassenstein and 
Reichardt, who proposed that insects compute motion 
by evaluating a spatiotemporal correlation of the sig- 
nals from the array of photodetector cells in the com- 
pound eye |l6[ . Essentially the same computational 
strategy is at the core of the motion energy models that 
are widely applied to the analysis of human perception 
and neural responses in primates 0] . Both the corre- 
lation model and the motion energy model have been 



extended in various ways to include saturation or nor- 
malization of the responses [Til Il9| . A seemingly very 
different approach emphasizes that motion is a rela- 
tionship between spatial and temporal variation in the 
image, and in the simplest case this means that veloc- 
ity should be recoverable as the ratio of temporal and 
spatial derivatives j2p|. Finally, the fact that the fly 
visual system achieves motion estimates with a preci- 
sion close to the physical limits [F_| motivates the 
theoretical question of which estimation strategies will 
in fact make best use of the available signals, and this 
leads to rather specific predictions about the form of 
the motion computation |2lj . The work described here 
has its origins in the attempt to test these predictions 
of optimal estimation theory. 

The traditional approach to testing theories of mo- 
tion estimation involves the design of particular visual 
stimuli which would highlight or contrast the predic- 
tions of particular models. This tradition is best devel- 
oped in the work of the Reichardt school, which aimed 
at testing and elaborating the correlation model for 
motion estimation in fly vision [22I |23| . The fact that 
simple mathematical models developed from elegant 
behavioral experiments in the early 1950s provided a 
basis for the design and analysis of experiments on the 
responses of single neurons decades later j 12 II should 
be viewed as one of the great triumphs of theoretical 
approaches to brain function. While the correlation 
model (or the related motion energy models) describes 
many aspects of the neural response, it probably is fair 
to say that the simplest versions of these models are 
not sufficient for describing neural responses to motion 
generally, and especially in more natural conditions. 

One of the clear predictions of optimal estimation 
theory is that computational strategies which make 
the best use of the available information in the retinal 
array must adapt to take account of different stimulus 
ensembles not only will the computation of mo- 
tion reach a different answer in (for example) a bright, 
high contrast environment and in a dim, low contrast 
environment, the optimal strategy actually involves 
computing a different function in each of these envi- 
ronments. Further, this adaptation in computational 
strategy is not like the familiar light and dark adapta- 
tion; instead it involves adjusting the brain's compu- 
tational strategy in relation to the whole distribution 
of visual inputs rather than just the mean. If statis- 
tical adaptation occurs, then the program of testing 
computational models by careful choice of stimuli has 
a major difficulty, namely that the system will adapt 
to our chosen stimulus ensemble and we may not be 
able to isolate different aspects of the computation as 
expected. 

There is evidence that the coding of dynamic sig- 
nals adapts to the input distribution both in the fly 
motion sensitive neurons 1251 1261 1271 and in the ver- 



tebrate retina |2jJ |29j, so that in these systems at 
least the representation of stimulus features depends 
on the context in which they are presented. In these 
examples context is defined by the probability distri- 
bution from which the signals are drawn, but there 
also is a large body of work demonstrating that neu- 
ral responses at many levels of the visual system are 
modulated by the (instantaneous) spatial context in 
which localized stimuli are presented [3(j- What is 
needed, then, is a method which allows us to ana- 
lyze the responses to more complex — and ultimately to 
fully natural — inputs and decompose these responses 
into the elementary computational steps. 

We will see that models of motion estimation share 
a common structure: estimation involves a projection 
of the high dimensional visual inputs onto a lower di- 
mensional space, followed by a nonlinear interaction 
among variables in this low dimensional space. The 
main goal of this paper, then, is to present an analysis 
method that allows us to observe directly the small 
number of dimensions which are relevant to the pro- 
cessing of complex, high dimensional inputs by any 
particular neuron. We use this approach to show that 
the motion sensitive neuron HI in the fly visual sys- 
tem computes a function of its inputs that is of the 
form predicted by optimal estimation theory. More 
generally, we believe that the reduction of dimension- 
ality may be the essential simplification required for 
progress on the subjects of neural coding and process- 
ing of complex, naturalistic sensory signals. 



II. MODELS OF MOTION ESTIMATION 

The classic model of visual motion detection is the 
Reichardt correlator, schematized in Fig. ^ In the 
simplest version of the model, the output signal is just 
the product of the voltages in neighboring photorecep- 
tors that have been passed through different filters, 
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This signal has a directionality, since the 11 th receptor 
voltage is passed through filter / while its left neigh- 
bor is passed through filter g. A better estimate of 
motion, however, would be genuinely antisymmetric 
rather than merely directional, and to achieve this we 
can subtract the signal computed with the opposite 
directionality: 
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Although it is natural to discuss visual computation 
using the photoreceptor signals as input, in fact we 
can't control these signals and so we should refer the 
computation back to image intensities or contrasts. If 
the photoreceptors give linear responses to contrast 
C (x, t) over some reasonable range, then we can write 
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where M(x) is the spatial transfer function or aper- 
ture of the receptor and T(t) is the temporal impulse 
response. Substituting into Eq. (J2J), we obtain 



9<$(t) = «i(t)« 4 (t) - s 2 (t)s 3 (t), 



(4) 



contrast pattern, C(x,t) 




imaging 




™ phototransduction 
v„ 

linear filtering 



X}* — — Kx) multiplication 
addition 



FIG. 1: The correlator model of visual motion detection, 
adapted from Ref. [TH . A spatiotemporal contrast pat- 
tern C(x,t) is blurred by the photoreceptor point spread 
function, M(x), and sampled by an array of photorecep- 
tors, two of which (neighboring photoreceptors numbers 
n — 1 and n) are shown here. After phototransduction, 
the signals in each photoreceptor are filtered by two differ- 
ent linear filters, f(t) and g(t). The outputs of these filters 
from the different photoreceptors, si (t) and S3 (i) from pho- 
toreceptor n and S2(t) and S4(t) from photoreceptor n — 1 
are multiplied and one of these products is subtracted from 
the other by the addition unit, yielding a direction selective 
response. 



where we are now careful to note that this is the es- 
timate obtained at point n on the retina, and each 
of the signals Si(i) is a linearly filtered version of the 
spatiotemporal contrast pattern, 
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with the temporal filters 

f(r) = J dT'f(r-r')T(r'), (9) 
g{r) = [ dr'g(r - r')T(r'). (10) 



Although much of the effort in applying the Reichardt 
(and related) models to experiment has focused on 
measuring the particular filters / and g, we want to 
emphasize here the fact the form of Eq. is simple 
no matter how complex the filters might be. 

In principle, the estimate of motion at time t can 
be influenced by the entire movie that we have seen 
up to this point in time, that is by the whole history 
C(x, t — r) with r > 0. In real systems it typically 
makes sense to use a finite time window, and in the 
concrete example of the fly's motion sensitive neurons, 
the relevant window for the computation can be on the 
order of 150 msec [U, while the absolute time reso- 
lution is at worst a few milliseconds 32]. This means 
that there are at least ~ 50 time points which can enter 
our description of this movie. To compute motion we 
need access to at least two independent spatial pixels, 
so altogether the history C(x, t — r) involves at least 
one hundred numbers: "the stimulus" is a point in a 
space of over 100 dimensions. Despite this complexity 
of the stimulus, Eq. @ tells us that — if this model 
of motion computation is correct — only four stimulus 
parameters are relevant. The computation of motion 
involves a nonlinear combination of these parameters, 
but the parameters themselves are just linear combi- 
nations of the ~ 10 2 stimulus variables. While our 
immediate concern is with the fly visual system, simi- 
lar dimensionalities arise if we think about the stimuli 
which drive neurons in the early stages of mammalian 
visual cortex, or the acoustic stimuli of relevance for 
early stages of auditory processing. 

More formally, we can think of the stimulus his- 
tory C(x, t — t) leading up to time f as a vector s t in 
D ~ 10 2 dimensions. If the motion sensitive neurons 
encode the output of the simplest Reichardt correlator 
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then the probability per unit time r(t) of generating 
an action potential will be of the form 
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where in this case K = 4 and the vectors Vj describe 
the spatiotemporal filters from Eq's. (JSHSJ. The cen- 
tral point is not that the function G has a simple 
form — it might not, especially when we consider the 
nonlinearities associated with spike generation itself — 
but rather that the number of relevant dimensions K 
is much less than the full stimulus dimensionality D. 

As described here, the correlation computation in- 
volves just two photoreceptor elements. In motion en- 
ergy models these individual detector elements are re- 
placed by potentially more complex spatial receptive 
fields [17|. so that M(x) is Eq. can have a richer 
structure than that determined by photoreceptor op- 
tics; more generally we can imagine that rather than 
two identical but displaced spatial receptive fields we 
just have two different fields. The general structure is 
the same, however: two spatial samples of the movie 
C(x, t) are passed through two different temporal fil- 
ters, and the resulting four variables are combined is 
some appropriate nonlinear fashion. Elaborated ver- 
sions of both the Reichardt and motion energy models 
might include six or eight projections, but the space 
of relevant stimulus variables always is much smaller 
than the hundreds of dimensions describing the input 
stimulus movie. 

Wide field motion sensitive neurons, such as the fly's 
HI cell which is the subject of the experiments below, 
are thought to sum the outputs of many elementary 
pairwise correlators to obtain an estimate of the global 
or rigid body rotational motion, 
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where N is the total number of photoreceptors and the 
local estimators 9^ at each point n along the retina 
are defined in Eq. we have written this as if all 
the photoreceptors are arrayed along a line, but there 
is a simple generalization to a fully two dimensional 
array. This computation takes as input not 2 x 50 
samples of the movie that we project onto the retina, 
but rather N x 50, which can reach D ~ 10 4 . In this 
case the essential reduction of dimensionality is from 
this enormous space to one of only (N/2) x 4 ~ 10 2 
dimensions. While dimensionality reduction still is a 
key to understanding the computation in this case, we 
would like to start with a more modest problem. In 
principle we can probe the response of this estimator 



by stimulating only two photoreceptors |33| . Alter- 
natively we can limit the dimensionality of the input 
by restricting stimuli to a single spatial frequency, or 
equivalently just two components which vary as sine 
and cosine of the visual angle x, 

I(x, t) = I • [1 + s(t) sm(kx) + c(t) cos(fcr)], (16) 

where I(x,t) is the light intensity with mean /, k/2n 
is the spatial frequency, and the dynamics of the stim- 
ulus is defined by the functions s(t) and c(t). The 
prediction of the correlator model Eq. (jT5|l is that 
the motion estimate is again determined by only four 
stimulus parameters, and in the limit that the cell in- 
tegrates over a large number of receptors we find the 
simple result 
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We emphasize that even with this simplification of the 
stimulus the known combination of temporal precision 
and integration times in motion computation mean 
that ~ 10 2 samples of the functions s(t) and c(t) could 
be relevant to the probability of spike generation in the 
motion sensitive neurons. 

Thus far we have discussed models of how the vi- 
sual system could estimate motion; one can also ask if 
there is a way that the system should estimate motion. 
In particular, for the problem of wide field motion es- 
timation faced by the fly, we can ask how to process 
the signals coming from the array of photodetectors 
so as to generate an estimate of velocity which is as 
accurate as possible given the constraints of noise and 
blur in the photoreceptor array. This is a well posed 
theoretical problem, and the results are as follows [2l| : 

Low SNR. In the limit of low signal-to-noise ratios 
(SNR), the optimal estimator is a generalization of the 
correlator model in Eq. I|15|) . 

0est(t) = [ dTd T 'V n (t ~ T)K nm (T,T')V m (t - t') 

nm 

+ ■■-, (18) 

where the higher order terms • • • include more compli- 
cated products of receptor voltages. More precisely, 
this is the leading term in a power series expansion, 
and at low SNR the leading term is guaranteed to 
dominate. The detailed structure of the kernels K„ m 
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depend on our assumptions about the statistical struc- 
ture of the visual world, but the general correlator 
form is independent of these assumptions. 

Intermediate SNR. As the signal-to-noise ratio in- 
creases, higher order terms in the expansion of Eq l|18|) 
become important and also the kernels K nm become 
modified so that the optimal estimator integrates over 
shorter times when the signals are more robust and 
when typical velocities are larger. 

High SNR. At high SNR, under a broad range of 
assumptions about the statistics of the visual world 
the optimal estimator crosses over to approximate 
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where A and A 1 are constants that depend on the de- 
tails of the image statistics and the last expression 
indicates schematically the limiting behavior at high 
contrast. In this limit the optimal estimator is just 
the ratio of temporal and spatial derivatives. At very 
high SNR there is no need to average over time to sup- 
press noise, so we show the estimator as being an in- 
stantaneous function of the receptor voltages and their 
derivatives; more generally at finite SNR the form of 
the estimator is the same but the receptor responses 
need to be smoothed over time. 

Perhaps the most interesting result of the theory is 
that both the correlator model and a ratio of deriva- 
tives model emerge as opposite limiting cases of the 
general estimation problem. The ratio of derivatives 
is in some sense the naive solution to the problem of 
motion estimation, since if the movie on the retina 
has the form C(x,t) — F(x — vt), corresponding to 
pure rigid motion, then it is easy to see that the ve- 
locity v = -(dC/dt)/{dC/dx). This isn't the gen- 
eral solution because the combination of differentia- 
tion and division tends to amplify noise; thus the ratio 
of derivatives emerges only as the high SNR limit of 
the general problem. At the opposite extreme, the 
correlator model is maximally robust against noise al- 
though it does make well known systematic errors by 
confounding the true velocity with the contrast and 
spatial structure of the image; the theory shows that 
these errors — which have correlates in the responses of 
the motion sensitive neurons, as well as in behavioral 
experiments — may emerge not as limitations of neural 
hardware but rather as part of the optimal strategy 
for dealing with noise. 

Although the theory of optimal estimation gives 
a new perspective on the correlator model, one can 
hardly count the well established correlator-like be- 



havior of motion sensitive neurons as a success of the 
theory. The real test would be to see the crossover 
from correlator-like behavior to the ratio of deriva- 
tives. Notice from Eq. that if the overall SNR is 
large but the contrast is (instantaneously) small, the 
optimal estimate is again correlator-like because the 
contrast dependent term in the denominator can be 
neglected. Thus even under a statistically stationary 
set of conditions corresponding to high SNR, we should 
be able to see both the correlator, with its mutiplica- 
tive nonlinearity, and the divisive nonlincarity from 
the ratio of derivatives. Behaviors consistent with this 
prediction have been observed |34| , but we would like 
a more direct demonstration. 

If we consider stimuli of the simple form in Eq. I|16[) . 
then it is easy to see that the motion estimator in Eq. 
(121) [I can be written as 
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where again B is a constant. More generally, if the 
receptor signals are all smoothed in the time domain 
by a filter /(r), then by analogy with Eq's. we 
can define four relevant dimensions of the input movie, 
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Thus the optimal estimator again is a function of four 
relevant dimensions out of the high dimensional space 
of input signals, these dimensions are built by oper- 
ating on s(t) and c(t) with two filters where one is 
the time derivative of the other, and then the four di- 
mensions are combined nonlincarly. By analogy with 
Eq. i we can identify the "correlator variable" V CO rr 
constructed from these four variables, 

Vcorr = S X • S 4 - S 2 • S 3 , (28) 

and the full optimal estimator normalizes this correla- 
tor variable through a divisive nonlinearity, 

A Vcorr 
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where T> = s\-\- s\ approximates the mean square spa- 
tial derivative of the image. Note that the different 
dimensions enter in highly symmetric combinations. 

The goal of the experiments described below is in 
fact to test the predictions from Eq. 1)2 7|). but we 
can also view this as an example of the more gen- 
eral problem of searching for relevant low dimensional 
structures within the high dimensional space of inputs 
to which a neuron responds. Related combinations 
of multiplicative and divisive nonlinearities arise quite 
generally in models for the "normalization" of neural 
responses in visual cortex |l8l l35l | and in particular in 
the normalization models applied to the motion sensi- 
tive cells of primate area MT [l^. Recent work [3(j 
suggests that this sort of computation can be derived 
for motion estimation in primate vision from the same 
sorts of optimization arguments used previously for in- 
sect vision plf . 

Although our focus here has been on the problem 
of motion estimation in vision, it is important to note 
that the same kind of dimensionality reduction pro- 
vides a precise formulation of feature selectivity in 
other systems as well. The classic example of center- 
surround organization in retinal ganglion cells can 
be thought of as projecting the image (or movie) onto 
two dimensions corresponding to spatial filters with 
different radii. Truly linear center-surround behavior 
would then correspond to the cell responding to just 
a linear combination (difference) of these two dimen- 
sions, so that really only one combined projection is 
relevant, while more subtle forms of interaction be- 
tween center and surround (e.g., shunting inhibition) 
would still correspond to a projection onto two dimen- 
sions but the nonlinear operation with relates firing 
probability to location in this two dimensional space 
would be more complicated. Similarly, the oriented 
receptive fields in cortical simple cells arc described as 
having multiple subregions but if there are non- 
linear interactions among the subregions then effec- 
tively there is no single projection of the stimulus to 
which the cell responds; rather the small number of 
subregions defines a projection onto a low dimensional 
subspace of images and there may be nontrivial com- 
putations within this subspace. Indeed, computational 
models for the detection of contours and object bound- 
aries have precisely this sort of nonlinear interaction 
among linear receptive subfields [37| ■ 

More generally, while filtering and feature selectiv- 
ity sometimes are used as synonyms, actually detecting 
a feature requires a logical operation, and in fact in- 
teresting features may be defined by conjunctions of 
logical operations. These more sophisticated notions 
of feature selectivity are not summarized by a single 
filter or receptive field. These computations do fit, 
however, within the framework suggested here, as non- 
linear (perhaps even hard threshold) operations on a 



low dimensional projection of the stimulus. 



III. SEARCHING FOR LOW DIMENSIONAL 
STRUCTURES 

We have argued that models of motion estimation, 
and perhaps other examples of neural fetaure selec- 
tivity, belong to a class of models in which neurons 
are sensitive only to a low dimensional projection of 
their high dimensional input signals. One approach 
would be to find the best model in this class to de- 
scribe particular neurons. It would, however, be more 
compelling if we could provide direct evidence for ap- 
plicability of the class of models before finding the best 
model within the class. The essential hypothesis of Eq. 
(If 2fl is that neurons are sensitive to a subspace of in- 
puts with dimensionality K much smaller than the full 
stimulus dimensionality D. This suggests a series of 
questions: 

1. Can we make a direct measurement of K, the 
number of relevant dimensions? 

2. Can we find the set of vectors {v n } that span 
this relevant subspace? 

3. Can we map the nonlinearity G(-) that the neu- 
ron implements within this space? 

In the simple case where there is only one relevant di- 
mension, the idea of reverse or triggered correlation 
[T2I l38| allows us to find this one special direction in 
stimulus space provided that we choose our ensemble 
of stimuli correctly. If we want to test a model in 
which there are multiple stimulus dimensions we need 
to compute objects that have a chance of defining more 
than one relevant vector. The basic suggestion comes 
from early work on the fly's motion sensitive visual 
neurons |39j . Instead of computing the average stimu- 
lus that precedes a spike, we can characterize the fluc- 
tuations around the average by their covariance ma- 
trix. Along most directions in stimulus space, this 
covariance matrix has a structure determined only by 
correlations in the stimulus. There are a small num- 
ber of directions, however, along which the stimuli that 
trigger spikes have a different variance than expected a 
priori. The fact that the number of directions with dif- 
ferent variances is small provides direct evidence that 
the cell is sensitive only to a small number of projec- 
tions. Further, identifying the directions along which 
the variance is different provides us with a coordinate 
system that spans the set of relevant projections. The 
following arguments, leading to Eq. I|44|l . formalize 
this intuition, answering the first two questions above. 
Then we turn to an analysis of the nonlinearities in G, 
leading to Eq. JSUJl . 



G 



It is useful to think about the spike train as a sum 
of unit impulses, 



(30) 



where the t\ are the spike times. Then the quantities 
of interest are correlation functions between p{t) and 
the stimulus vector s t ; recall that this vector can rep- 
resent both the spatial and temporal variations in the 
stimulus movie. As an example, the average stimulus 
preceding a spike is 
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where f is the mean spike rate and (• • •) denotes an 
average over a very long experiment. If we repeat the 
same stimulus for many trials and average the resulting 
spike trains, then we will obtain the probability per 
unit time r(t) that the cell spikes, where t is measured 
by a clock synchronized with the repeating stimulus as 
in the usual poststimulus time histogram. Thus 



(p(*))trials = r(t). 
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The spike rate r(t) is an average of the spike train p(t) 
over all the noise in the neural response, so that when 
we need to compute averages over a long experiment, 
we imagine doing this formally by first averaging over 
the noise with the stimulus held fixed, and then aver- 
aging over the distribution of signals; for example, 



(p(t)s t ) = (r(t)s t ) s , 



(33) 



where (• ■ -} s denotes an average over the distribution 
of signals presented in the experiment. 

To find multiple relevant directions we consider the 
matrix of second moments that characterizes the stim- 
uli leading to a spike [3j|. If the components of the 
stimulus vector s t are written as s t (i), with the index 
i = 1, 2, • • • , D running over the full dimensionality of 
the stimulus space, then the second moments of stimuli 
preceeding a spike are 

C spike (i,j) ee (s tspike ( i )s tspike (j)} (34) 

= ±{p(t>t(i)st(j))- (35) 
r 

From the arguments above this can be rewritten as 

C spiks {i,j) = ~{r(t)at(i)at(j)). (36) 

It is crucial that C sp ik (*,i) is something we can es- 
timate directly from data, looking back at the stim- 
uli that lead to a spike and computing the matrix of 
their second moments according to the definition in 
Eq. 134(1 . On the other hand, Eq. I(36|) gives us a 



way of relating these computations from the data to 
underlying models of how the spike rate r(t) depends 
on the stimulus. 

In general it is hard to go further than Eq. l|3tj[) an- 
alytically. More precisely, with stimuli chosen from an 
arbitrary distribution the relation between C sp ik e and 
some underlying model of the response can be arbitrar- 
ily complicated |4Cj . We can make progress, however, 
if we are willing to restrict our attention to stimuli 
that are drawn from a Gaussian distribution as in re- 
verse correlation analyses. It is important to realize 
that this restriction, while significant, does not specify 
a uniquely "random" stimulus. Gaussian does not im- 
ply white; we can construct an arbitrary correlation 
function for our stimuli, including correlation func- 
tions modelled after natural signals Further, we 
can construct stimuli which are nonlinear functions of 
underlying "hidden" Gaussian variables; these stimuli 
can have a complex and even naturalistic structure — 
see, for example, Ref |2jj — and such hidden variable 
methods may be useful as a bridge to more general 
application of the dimensionality reduction idea. 

If the distribution of signals is Gaussian, then av- 
erages such as Eq. (I36f) are straightforward to com- 
pute. The key step is the following identity: If x = 
x\,X2, ■ ■ ■ ,xd is a vector drawn from a multidimen- 
sional Gaussian distribution with zero mean, and /(x) 
is a diffcrentiablc function of this vector, then 



(x i /(x))=^C ij < 
j=i 



dm 



(37) 



where Cjj = (xiXj) is the covariance matrix of x. This 
can be applied twice: 



k=l 
D 

= ^2 Cm 



dh/(x)] 
dx k 



<5 jm (/(x)> + x 



D 



dx m 



(38) 



Cy(/(X))+ J2 C ^ C ^ 



d 2 m 



(39) 



We can use this in evaluating C sp ike from Eq (|36|l by 
identifying the vector x with the stimulus s t and the 
spike rate r(t) with the function /(x). The result is 



C S pike (»,j) = C piioI (i,j) + AC(i,j), 



(40) 



(41) 
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where we sum over the repeated indices k and /, and 
Cprior is the second moment of stimuli averaged 
over the whole experiment, 



Cprior(i,j) = (S t (i)s t (j)). 



(42) 



Further, if the rate has the 'low dimensional' form of 
Eq. H12f) . then the derivatives in the full stimulus space 
reduce to derivatives of the function G with respect to 
its K arguments: 



8 2 r(t) 



ds t {k)ds t (l) 



_ d 2 G(s!,s 2 ,- ■ ■ s K ) 
ds a dsp 



(43) 



where as with the stimulus vector s t we use v a (i) 
to denote the components of the projection vectors 
v Q ; again the index i runs over the full dimension- 
ality of the stimulus, i — 1,2, ■■■,D while the in- 
dex a runs over the number of relevant dimensions, 
a = 1, 2, • • • , K, and we sum over repeated indices a 
and (3. 

Putting these results together, we find an expression 
for the difference AC between the second moments 
of stimuli that lead to a spike and stimuli chosen at 
random: 



AC(i,i) 



A{a,0) = 



[C ptioi (i,k)v a (k)]A(a,j3) 

X [v^(OCprior(i,i)] , 

/ d 2 G(s 1 ,s 2 ,- ■ ■ s K ) \ 
\ ds a dsR /' 



(44) 
(45) 



and we sum over all repeated indices a, (3, k and I in 
Eq. (|44|l . There are several important points which 
follow from these expressions. 

First , Eq. P^l shows that A C (i , j ) , which is a D x D 
matrix, is determined by the K x K matrix A(a, j3) 
formed from the second derivatives of the function G. 
This means that AC(i,j) can have only K nonzero 
eigenvalues, where K is the number of relevant stimu- 
lus dimensions. Thus we can test directly the hypothe- 
sis that the number of relevant dimensions is small just 
by looking at the eigenvalues of AC. Further, this test 
is independent of assumptions about the nature of the 
nonlinearities represented by the function G. 

Second, the eigenvectors of AC associated with the 
nonzero eigenvalues are linear combinations of the vec- 
tors v Q , blurred by the correlations in the stimulus 
itself. More precisely, if we look at the set of nontriv- 
ial eigenvectors u Q , with a — 1,2, ■■■,K, and undo 
the effects of stimulus correlations to form the vectors 
v' a = [Cp r i or ] _1 -u Q , then we will find that these vectors 
span the same space as the vectors v a which define the 
relevant subspace of stimuli. 

Third, we note that the eigenvalue analysis of AC 
is not a principal components analysis of the stimu- 
lus probability distribution. In particular, unless the 



function G were of a very special form, the distribu- 
tion of stimuli that lead to a spike will be strongly 
non-Gaussian, and so a principal components analy- 
sis of this distribution will not capture its structure. 
Further, directions in stimulus space that have small 
variance can nonetheless make large contributions to 
AC. Note also that the eigenvalues of AC can be 
both positive or negative, while of course the spectrum 
of a covariance matrix (associated with the principal 
components of the underlying distribution) always is 
positive. 

Finally, the eigenvectors (or their deblurred ver- 
sions) that emerge from this analysis are useful only 
because they define a set of dimensions spanning the 
space of relevant stimulus features. Once we are in this 
restricted space, we are free to choose any set of coordi- 
nates. In this sense, the notion of finding "the" linear 
filters or receptive fields that characterize the cell be- 
comes meaningless once we leave behind a model in 
which only one stimulus dimension is relevant. The 
only truly complete characterization is in terms of the 
full nonlinear input/output relation within the rele- 
vant subspace. 

Once we have identified a subspace of stimuli 
Si, S2, • • • , sk, we actually can map the nonlinear func- 
tion G directly provided that K is not too large. We 
recall that the spike rate r(t) is the probability per 
unit time that a spike will occur at time t, given the 
stimulus s t leading up to that time. Formally, 



r{t) = P[spike@i|s t ]. 



(46) 



From Eq. (|12fl . the rate depends only on K projections 
of the stimulus, and so 



r(t) = P[spike@£|si,s 2 , • • • ,Sff]- 



(47) 



But the probability of a spike given the stimulus can 
be rewritten using Bayes' rule: 



P[spike@f|si, s 2 , 

x P[si,s 2 



,s K 



P [spike @i\ 
P[si,s 2 , • • • , s K ] 
,s A '|spike@£]. (48) 

In the same way that the function P [spike @ t\s t ] 
gives the time dependent spike rate r(t), the number 
P[spike@£] is just the average spike rate f. Thus the 
nonlinear computation within the if -dimensional rele- 
vant subspace that determines the neural response can 
be found from the ratio of probability distributions in 
this subspace, 



r(t) 



rG(si, s 2 , ■ 
- P[si,s 2 , 



•,sk) 

■ ■ , sx|spike@£] 



P[si,s 2 , •■■,%] 



(49) 
(50) 



Now the full distribution P[si, s 2 , ■ ■ ■ , sk] is known, 
since this defines the conditions of the experiment; fur- 
ther, we have considered situations in which this distri- 
bution is Gaussian and hence is defined completely by 
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angular position (omm) 

FIG. 2: A two second segment of the stimulus and cor- 
responding neural response. The stimulus movie consists 
of vertical stripes. Here each horizontal slice indicates the 
pattern of these stripes at one instant of time. Plus signs 
at the right indicate spike times from HI. Brief periods 
with coherent motion to the left are correlated with the 
spikes, while clear motions to the right inhibit the neuron. 
The challenge of the present analysis is to make more pre- 
cise this connection between features of the movie and the 
probability of spiking. 



a K x K covariance matrix. The probabiliity distribu- 
tion of stimuli given a spike, the response-conditional 
ensemble pffll ]. can be estimated by sampling: each 
time we see a spike, we can look back at the full stim- 
ulus St and form the K projections s\, s%, ■ ■ ■ , sk] this 
set of projections at one spike time provides one sample 
drawn from the distribution P[si, S2, ■ ■ ■ , sk\ spike @ t], 
and from many such samples we can estimate the un- 
derlying distribution. This Bayesian strategy for map- 
ping the nonlinear input/ouput relation provides a 
large dynamic range proportional to the total number 
of spikes observed in the experiment — see, for exam- 
ple, Ref. [11. 

We emphasize that the procedure described above 
rests not on a family of parameterized models which 
we fit to the data, but rather on the idea that if the 
dimensionality of the relevant subspace is sufficiently 
small then we don't really need a model. In practice 
we have to assume only that the relevant functions are 
reasonably smooth, and then the combination of eigen- 
value analysis and Bayesian sampling provides explicit 
answers to the three questions raised at the beginning 
of this section. 



IV. AN EXPERIMENT IN HI 

We apply these ideas to an experiment on the fly's 
motion sensitive HI neuron, where we would like to 
dissect the different features of motion computation 
discussed in Section ^ A segment of the visual stim- 
ulus and Hi's response is shown in Fig. [21 While 
the most elementary model of motion computations 
involves temporal comparisons between two pixels, HI 
is a wide field neuron and so is best stimulated by spa- 
tially extended stimuli. To retain the simplicity of the 
two-pixel limit in an extended stimulus we consider 
here a stimulus which has just one spatial frequency, 
as in Eq. (|16() . For technical details of stimulus gen- 
eration and neural recording see Appendix lAl 

To make use of the results derived above, we choose 
s(t) and c{t) to be Gaussian stochastic processes, as 
seen in Fig, El with correlation times of 50 msec. Sim- 
ilar experiments with correlation times from 10-100 
msec lead to essentially the same results described 
here, although there is adaptation to the correlation 
time, as expected from earlier work [25ll34j . The prob- 
lem we would like to solve is to describe the relation 
between the stimulus movie I(x, t) and the spike ar- 
rival times (cf. Fig. 

Simple computation of the spike triggered average 
movie produces no statistically significant results: the 
cell is sensitive to motion, and invariant to the overall 
contrast of the movie, so that the stimulus generated 
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FIG. 3: Statistics of the visual stimulus. Probability distri- 
bution of s(t) and c(t) compared with a Gaussian. Because 
these signals represent image contrast, there is inevitably 
some clipping at large amplitude (light intensity cannot 
be negative), but this affects only ~ 1% of the signals. 
At right, the autocorrelation of the signals, (s(t)s(t + r)) 
(circles) and {c(t)c(t + r)) (squares) are almost perfect ex- 
ponential decays exp( — \t\/t c ), t c — 50msec. 
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FIG. 4: Distribution of interspike intervals. Solid line 
shows the observed distribution collected in 2 msec bins. 
Dashed line is an exponential fit to the tail of the distribu- 
tion. Exponential decay of the interval distribution means 
that successive spikes are occurring independently, and the 
data indicate that one must wait ~ 40 msec for this inde- 
pendence to be established. Further analysis is done only 
with these "isolated spikes." 

by the transformation (s, c) — > (— s, — c) will produce 
indistinguishable responses. This being said, we want 
to proceed with the covariance matrix analysis. 

From previous work we know that different patterns 
of spikes in time can stand for very different motion 
trajectories [3^. From the present point of view, this 
connection to spike patterns is not a question about 
the nature of the motion computation but rather about 
how the output of this computation is represented in 
the spike train. To simplify the discussion, we will 
focus on spikes that occur in relative isolation from 
previous spikes. Specifically, when we look at the in- 
terspike interval distribution in Fig. 0] we see that 
for intervals longer than ~ 40 msec the distribution 
has the form of a decaying exponential. This is what 
we expect if after such long intervals spikes are gen- 
erated independently without memory or correlation 
to previous spikes. More colloquially, spikes in this 
exponential regime are being generated independently 
in response to the stimulus, and not in relation to pre- 
vious spikes. All further analysis is done using these 
isolated spikes; for a related discussion in the context 
of model neurons see Ref . [42] • 

Qualitative examination of the change in stimulus 
covariance in the neighborhhod of an isolated spike, 
AC, reveals that it is a very smooth matrix, consistent 



with the idea that it is composed out of a small number 
of significant eigenvectors. To quantify these observa- 
tions, and proceed along the analysis program outlined 
in the previous section, we diagonalize AC to give the 
eigenvalues shown in Fig. [SJ In trying to plot the re- 
sults there is a natural question about units. Because 
the stimuli themselves arc not white, different stimulus 
components have different variances. The eigenvalue 
analysis of AC provides a coordinate system on the 
stimulus space, and the eigenvalues themselves mea- 
sure the change in stimulus variance along each coordi- 
nate when we trigger on a spike. Small changes in vari- 
ance along directions with large total variance presum- 
ably are not very meaningful, while along a direction 
with small variance even a small change could mean 
that the spike points precisely to a particular value 
of that stimulus component. This suggests measuring 
the eigenvalues of AC in units of the stimulus variance 
along each eigendirection, and this is what we do in 
Fig. [S] This normalization has the added value (not 
relevant here) that one can describe the stimulus in 
terms of components with different physical units and 
still make meaningful comparisons among the different 
eigenvalues. FigureJSJshows clearly that four directions 
in stimulus space stand out relative to a background of 
196 other dimensions. The discussion in Section HTl of 
models for motion estimation certainly prepares us to 
think about four special directions, but before looking 
at their structure we should answer questions about 
their statistical significance. 

In practice we form the matrix AC from a finite 
amount of data; even if spikes and stimuli were com- 
pletely uncorrelated, this finite sampling gives rise to 
some structure in AC and to a spectrum of eigenvalues 
which broadens around zero. One way to check the sig- 
nificance of eigenvalues is to examine the dependence 
of the whole spectrum on the number of samples. Out 
of the ~ 8000 isolated spikes which we have collected 
in this experiment, we show in left panel of Fig. 
what happens if we choose 10%, 20%, ... , 90% at 
random to use as the basis for constructing AC. The 
basic structure of four modes separated from the back- 
ground is clear once we have included roughly half the 
data, and the background seems (generally) to narrow 
as we include more data. 

A different approach to statistical significance is to 
generate a set of random data that have comparable 
statistical properties to the real data, breaking only 
the correlations between stimuli and spikes. If we shift 
the all the spikes forward in time by several seconds rel- 
ative to the stimulus, then since the correlations in the 
stimulus itself are short-lived, there will be no residual 
correlation between stimulus and spikes, but all the in- 
ternal correlations of these signals are untouched. If 
we choose the shift times at random, with a minimum 
value, then we can generate many examples of uncorre- 
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lated stimuli and spikes, and find the eigenvalue spec- 
tra of AC in each example. Taken together a large 
number of these examples gives us the distribution of 
eigenvalues that we expect to arise from noise alone, 
and this distribution is shown in cumulative form in 
the right panel of Fig. A crucial point — expected 
from the analytic analysis of eigenvalues in simpler 
cases of random matrices 01 — is that the distribution 
of eigenvalues in the pure noise case has a sharp edge 
rather than a long tail, so that the band of eigenvalues 
in a single data set will similarly have a fairly definite 
endpoint rather than a long tail of 'stragglers' which 
could be confused with significant dimensions. While 
larger data sets might reveal more significant dimen- 
sions, Fig. indicates that the present data set points 
to four and only four significant stimulus dimensions 
out of a total of 200. 

In Fig. |3 we show the eigenvectors of AC associated 
with the four significant nonzero eigenvalues. We can 
think of these eigenvectors as filters in the time do- 
main which are applied to the two spatial components 
of the movie s(t) and c(t); the four eigenvectors thus 
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FIG. 5: Eigenvalues of AC. Stimuli are represented as seg- 
ments of s(t) and c(t) in windows of ±200 msec surrounding 
isolated spikes, sampled at 4 msec resolution; AC thus is a 
200 x 200matrix. As explained in the text, the eigenvalue 
measures the spike-triggered change in stimulus variance 
along a particular direction in stimulus space, while the 
eigenvector specifies this direction. Since the stimulus it- 
self has correlations, different directions have different vari- 
ances a priori, and we express the change in variance as a 
fraction of the total variance along the corresponding di- 
rection. There are four dimensions which stand out clearly 
from the background. 
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FIG. 6: Testing the significance of the eigenvalue distribu- 
tions. At left we show the evolution of the eigenvalue spec- 
trum as we analyze larger data sets. The four dimensions 
which stand out from the background in the full data set 
also have stable eigenvalues as a function of data set size, 
in contrast to the background of eigenvalues which come 
from a distribution which narrows as more data is included. 
At right we show the cumulative probability distribution 
of eigenvalues from surrogate data in which the correla- 
tons between stimulus and spike train have been broken 
by random time shifts, as explained in the text. Eigenval- 
ues with absolute value larger than 0.1 arise roughly 1% 
of the time, but there is a very steep edge to the distribu- 
tion such that absolute values larger than 0.13 occur only 
0.01% of the time. The sharp edge in the random data 
is essential in identifying eigenvalues which stand out from 
the background, and this edge is inherited from the simpler 
problem of eigenvalues in truly random matrices }43j. 



determine eight filters. We see that among these eight 
filters there are some similarities. For example, the fil- 
ter applied to c(t) in eigenvector (a) looks very similar 
to that which is applied to s(t) in eigenvector (b), the 
filter applied to s(t) in eigenvector (c) is close to being 
the negative of that which is applied to c(t) in eigen- 
vector (d), and so on. Some of these relations are a 
consequence of the approximate invariance of Hi's re- 
sponse to static translations of the visual inputs, and 
this is related to the fact that the significant eigen- 
values form two nearly degenerate pairs. In fact the 
similarity of the different filters is even greater than 
required by translation invariance, as indicated by the 
singular value decompostion shown in the lower panels 
of Fig. the eight temporal filters which emerge from 
the eigenvalue analysis are constructed largely from 
only two underlying filters which account for 80% of 
the variance among the filter waveforms. These two fil- 
ters are extended in time because they still contain [as 
expected from Eq. (|44ll ] a "blurring" due to intrinsic 
correlations in the stimulus ensemble. When we de- 
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convolve these correlations we inevitably get a noisy 
result, but it is clear that the two filters form a pair, 
one which smooths the input signal over a ~ 40 msec 
window, and one which smooths the time derivative of 
the input signal over the same window. 

The results thus far already provide confirmation for 
important predictions of the models discussed in Sec- 
tion[n] With stimuli of the form used here, these mod- 
els predict that the motion estimator is constructed 
from four relevant stimulus dimensions, that these di- 
mensions in turn are built from just two distinct tem- 
poral filters applied to two different spatial compo- 
nents of the visual input, and that one filter is the 
time derivative of the other [cf. Eq's. l|23H27|l ]. All 
three of these features are seen in Figs. JSJ and Q. 

To proceed further we sample the probability dis- 
tributions along the different stimulus dimensions, as 
explained in the discussion surrounding Eq. (|S*U)l . Al- 
though the reduction from 200 to 4 stimulus dimen- 
sions is useful, it still is difficult to examine probabil- 
ity distributions in a four dimensional space. We pro- 
ceed in steps, guided by the intuition from the models 
in Section [H] We begin by looking at a two dimen- 
sional projection. We recall that the optimal estima- 
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FIG. 7: Top panels: Eigenvectors associated with the four 
significant eigenvalues. Solid lines show components along 
the s stimulus directions, dashed lines along the c direc- 
tions. Bottom panels: Analysis of filters, (e) Results of 
singular value decomposition demonstrate that most of the 
variation among the eight filters at left can be captured by 
two modes, which are shown in (f). Deconvolving the stim- 
ulus correlations from these vectors we find the results in 
(g), where we note that anti-causal pieces of the filters are 
now just noise, as expected if the deconvolution is success- 
ful. Closer examination shows that one of these filters is 
approxmiately the derivative of the other, and in (h) we 
impose this condition exactly and truncate the filters to 
the 100 msec window which seems most relevant. 
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FIG. 8: Spike probability in two stimulus dimensions. 
Color code represents log 10 [r(t)/(spikes/sec)], with r(t) de- 
termined from sampling the probability distributions in 
Eq. 1501 . and we normalize the projections of the stimulus 
along each dimension such that they have unit variance. 
Note that there is no absolute preference for the sign of 
the individual stimulus components; rather the spike prob- 
ability is higher when the two components have opposite 
signs and lower when they have the same sign. This is 
the pattern expected if the neuron in fact is sensitive to 
the product of the two components, as in the correlation 
computation of Eq. (12811 . 



tion strategy involves the correlation of spatial and 
temporal derivatives. When we do this [as in Eq's. 
(I27|l and l|28|) ] we find terms involving the product of 
the time derivative of s(t) with the current value of 
c(t), as well as the other combination is which s and 
c are exchanged. This suggests that we look at the 
response of HI in the plane determined by the differ- 
entiating filter applied to s and the smoothing filter 
applied to c, and this is shown in Fig. |S| We see the 
general structure expected if the system is sensitive to 
a product of the two dimensions: symmetry across the 
quadrants, and contours of equal response have an ap- 
proximately hyperbolic form. The same structure is 
seen in the other pair of dimensions, but with the 90° 
rotation expected from the theory. 

If we take the product structure of the correlator 
models — and the numerator of the optimal estimation 
theory prediction in Eq. I|27|l — seriously, then we can 
take the two dimensional projection of Fig. [SJand col- 
lapse it onto a single dimension by forming the prod- 
uct of the two stimulus variables. We can do the same 
thing in the other two stimulus dimensions, and then 
sum these two (nonlinear) stimulus variables to form 
the anti-symmetric combination V CO rr from Eq. I|28|l . 
The dependence of the spike rate on V cor r is shown in 
Fig. |5J the rate is modulated by roughly a factor of 
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one hundred in response to changes in this stimulus 
variable. We emphasize that this conclusion is derived 
not by "manually" changing the value of the correla- 
tor variable, but rather by extracting this nonlinear 
combination of stimulus variables from the continuous 
variations of a complex dynamic input. 

The correlator variable V CO rr = si • S3 — S2 ■ S4 from 
Eq. I|28|l is only one of many possible nonlinear com- 
binations of the four stimulus dimensions that emerge 
from the analysis of AC. It is predicted by theory to 
be a central part of the motion computation, but it 
also defines a quantity that is invariant to a time- 
independent spatial translation of the visual stimu- 
lus; thus the correlator variable automatically incor- 
porates the approximate translation invariance of Hi's 
rcponse. Another such invariant combination is 

V = sl+sl. (51) 

In the predictions of optimal estimation theory [cf. Eq. 
(|27|l ]. this combination arises because it is proportional 
to the mean square spatial derivative of the (tempo- 
rally filtered) image on the retina. In Fig. 1101 we show 
the firing rate r(t) as a function of the two variables 
Vcorr and T>. By exploring this (nonlinear) two dimen- 
sional space we expose a much larger dynamic range 
of firing rates than can be seen by projecting onto the 
correlator variable alone as in Fig. |U| We see in the 
right panel of Fig. ^| that with V fixed there is lit- 
tle evidence of saturation in the plot of spike rate vs. 
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FIG. 9: Neural response to the correlator variable V CO rr 
from Eq. 1281 . At left, probability distributions of the 
correlator variable a priori (dashed) and conditional on an 
isolated spike (solid). Note that since the correlator vari- 
able is a nonlinear combination of the four relevant stimu- 
lus dimensions, the prior distribution is not Gaussian but 
in fact almost precisely exponential in form. At right, the 
spike rate r(t) calculated from these distributions using Eq. 
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FIG. 10: Spike probability in two nonlinear dimensions. At 
left, the spike rate as a function of the correlator variable 
Vcorr (as in Fig. |UJ and the mean square spatial deriva- 
tive of the movie, as computed from our stimulus projec- 
tions as proportional to T> [Eq. 1511 1. Color scale indicates 
log 10 r(f), with r measured in spikes/sec. We see a hint 
that the contours of constant color form a fan, as expected 
if the neuron responds approximately to the ratio of the 
correlator variable and the mean square spatial derivative. 
At right, we make this clearer by sorting the stimuli into 
small, medium and large magnitudes of the spatial deriva- 
tive, effectively taking slices through the two dimensional 
plot at left and averaging over the distribution of stimuli. 
The gain of the response to the correlator variable clearly 
is modulated by the mean square spatial derivative. 



Vcorr, and the gain in the neural response to the cor- 
relator variable is modulated by the magntiude of T>. 

Another way to look at the results of Fig. ^]is to 
plot r(t) vs. Vcorr with separate points for the differ- 
ent values of V. Because the rate has a substantial 
dependence on T>, there is a large amount of scatter, 
as seen at the left panel of Fig. ^] On the other hand, 
the prediction of optimal estimation theory is that the 
response should be a function of normalized correlator 
variable V CO rr/(-B + V) [Eq If this prediction is 

correct, then if we plot r(t) vs this normalized quan- 
tity all of the scatter should collapse. The only un- 
certainty from optimal estimation theory is the value 
of the parameter B, which depends on detailed as- 
sumptions about the statistical structure of the visual 
world. We can choose the value of B which minimizes 
the scatter, however, and the results are shown in the 
right panel of Fig. ^] We see that by constructing the 
normalized correlator variable predicted from optimal 
estimation theory we have revealed a dynamic range of 
nearly 10 3 in the modulation of spike rate. Since the 
experiment involves only ~ 8 x 10 3 isolated spikes, it 
is difficult to imagine meaningful measurements over 
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a larger dynamic range. 



V. DISCUSSION 

The approach to analysis of neural responses that 
we have taken here grows out of the reverse corre- 
lation method. We recall that correlating the input 
and output of neurons, especially spiking neurons, can 
be viewed from two very different perspectives, as re- 
viewed in Ref. In one view, we imagine writing 
the firing rate of a neuron as a functional power series 
in the input signal, 



1 



i ij 



(52) 

where as in the discussion above s t (i) is the i th com- 
ponent of the stimulus history that leads up to time 
t, and Ki, K2, ■■■ are a set of "kernels" that form 
the coefficients of the series expansion. If we choose 
inputs St with Gaussian statistics, then by comput- 
ing the spike-triggered average stimulus we can re- 
cover K\ {%) , by computing the spike-triggered second 
moments we can recover K2(i,j), and so on; this is 
the Wiener method for analysis of a nonlinear system. 
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FIG. 11: Collapsing back to a single stimulus feature. At 
left we show the data of Fig. 1101 as a scatter plot of spike 
rate vs. the correlator variable. Each data point corre- 
sponds to a combination of the correlator variable and the 
mean square spatial derivative; the broad scatter arises 
from the fact that the spike rate depends significantly on 
both variables. At right, we construct a normalized version 
of the correlator variable corresponding to the form pre- 
dicted by optimal estimation theory, Eq. I|29|l : the one ar- 
bitrary parameter is chosen to minimize the scatter. Note 
that the same number of data points appear in both plots; 
at right many of the points lie on top of one another. 



Poggio and Reichardt |2jj emphasized that the correla- 
tor model of motion estimation defines a computation 
that is equivalent precisely to a functional series as in 
Eq (|B^|l but with only the K2 term contributing. Fur- 
ther, they showed that other visual computations, such 
as the separation of objects from background via rel- 
ative motion, can be cast in the same framework, but 
the minimal nonlinearities are of higher order (e.g., K± 
in the case of figure-ground separation). 

Marmarelis and McCann used the Wiener method to 
analyze the responses of fly motion sensitive neurons 
44] . Using a pair of independently modulated light 
spots they verified that K\ makes a negligible contri- 
bution to the response, and showed that K2 has the 
dynamical structure predicted from experiments with 
double flash stimuli. By construction the second order 
Wiener kernel describes the same quadratic nonlinear- 
ity that is present in the correlator model. Results 
on the structure of K2 in motion sensitive neurons, 
as with many other experiments, thus are consistent 
with the correlator model, but don't really constitute 
a test of that model. In particular, such an analysis 
cannnot exclude the presence of higher order contribu- 
tions, such as those described by Eq l|19H21fl . 

Despite its formal generality, the Wiener approach 
has the problem that it is restricted in practice to 
low order terms. If we can measure only the first few 
terms in Eq l|52[l we arc in effect hoping that neural re- 
sponses will be only weakly nonlinear functions of the 
sensory input, and this generally is not the case. Simi- 
larly, while Poggio and Reichardt were able to identify 
the minimum order of nonlinearity required for differ- 
ent visual computations, it is not clear why the brain 
should use just these minimal terms. Crucially there is 
an interpretation of reverse correlation that does not 
rest on these minimalist or weak nonlinearity assump- 
tions. 

In their early work on the auditory system, de Boer 
and Kuyper emphasized that if there is a single stage of 
linear filtering followed by an arbitrary instantaneous 
nonlinearity, then with Gaussian inputs the spike- 
triggered avera ge o r "reverse correlation" will uncover 
the linear filter 381. In our notation, if we can write 



r{t)=¥G{v 1 -s t ) 



(53) 



then the spike-triggered average stimulus allows us to 
recover a vector in stimulus space proportional to the 
filter or receptive field vi independent of assumptions 
about the form of the nonlinear function G, as long 
as symmetries of this function do not force the spike- 
triggered average to be zero. The hypothesis that neu- 
rons are sensitive to a single component of the stimulus 
clearly is very different from the hypothesis that the 
neuron responds linearly. Our approach generalizes 
this interpretation of reverse correlation to encompass 
models in which the neural response is driven by mul- 
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tiple stimulus components, but still many fewer than 
are required to describe the stimulus completely, as in 
Eq (|T2). 

The success of the receptive field concept as a tool 
for the qualitative description of neural responses has 
led to considerable interest in quantifying the precise 
form of the receptive fields and their analogs in dif- 
ferent systems. This focus on the linear component 
of the strongly nonlinear neural response leaves open 
several important questions. In the auditory system, 
for example, observation of spectrotemporal receptive 
fields with frequency sweep structures does not tell us 
whether the neuron simply sums the energy in differ- 
ent frequency bands with different delays, or if the 
neuron has a strong, genuine "feature detecting" non- 
linearity such that it responds only when power in one 
frequency band is followed by power in a neighboring 
band. Similarly, the different models of motion esti- 
mation discussed in Section^are distinguished not by 
dramatically different predictions for the spatiotem- 
poral filtering of incoming visual stimuli, but by the 
way in which these filtered components are combined 
nonlinearly. If we hope to explore nonlinear interac- 
tions among multiple stimulus components, it is cru- 
cial that there not be too many relevant components. 
The spike-triggered covariance method as developed 
here provides us with tools for counting the number of 
relevant stimulus dimensions, for identifying these di- 
mensions or features explicitly, and for exploring their 
interactions. 

As far as we know the first consideration of spike- 
triggered covariance matrices was by Bryant and Se- 
gundo in an analysis of the neural responses to in- 
jected currents |45(; in many ways this paper was far 
ahead of its time. Our own initial work on the spike- 
triggered (or more generally response-triggered) co- 
variance came from an interest in making models of 
the full distribution of stimuli conditional on a spike 
or combination of spikes, from which we could compute 
the information carried by these events. In that con- 
text the small number of nontrivial eigenvalues in AC 
meant that we could make estimates which were more 
robust against the problems of small sample size 39]. 
Roughly ten years later we realized that this struc- 
ture implies that the probability of generating a spike 
must depend on only a low dimensional projection of 
the stimulus, and that the analysis of spike-triggered 
covariance matrices thus provides a generalization of 
reverse correlation to multiple relevant dimensions, as 
presented here [46l| . 

The ideas of the covariance matrix analysis were 
stated and used in work on adaptation of the neural 
code in the fly motion sensitive neurons |26| , and in 
characterizing the computation done by the Hodgkin- 
Huxley model neuron [42] • Preliminary results suggest 
that the same approach via low-dimensional struc- 



tures may be useful for characterizing the feature selec- 
tivity of auditory neurons 47] . In the mammalian vi- 
sual system the covariance matrix methods have been 
used in both the retina and in the primary visual 
cortex [49l IBol IBH to characterize the neural response 
beyond the model of a single receptive field. Most re- 
cently this approach has been used to reveal the sen- 
sitivity of retinal ganglion cells to multiple dimensions 
of temporal modulation, and to demonstrate a strik- 
ing diversity in how these dimensions are combined 
nonlinearly to determine the neural response [5^ . 

In the particular case of motion estimation, the 
spike-triggered covariance method has made it pos- 
sible to test important predictions of optimal estima- 
tion theory. The optimal motion estimator exhibits 
a smooth crossover from correlator-like behavior at 
low signal to noise ratios to ratio of gradient behavior 
at high signal to noise ratio |2l|. In particular this 
means that the well known confounding of contrast 
and velocity in the correlator model should give way 
to a more "pure" velocity sensitivity as the signal to 
noise ratio is increased. In practice this means that the 
contrast dependence of responses in motion sensitive 
neurons should saturate at high contrast but this sat- 
urated response should retain its dependence on veloc- 
ity. Further, the form of the saturation should depend 
on the mean light intensity and on the statistics of 
the input movies. All of these features are observed in 
the responses of HI [34| , but none is a "smoking gun" 
for the optimal estimator. Specifically, the optimal 
estimator has two very different types of nonlinear- 
ity: the multiplicative nonlinearity of the correlator 
model, and the divisive nonlinearity of the gradient 
ratio. It is this nonlinear structure — and not, for ex- 
ample, a dramatic shift in frequency response or other 
quasi-linear filtering properties — that seems to be the 
central prediction of optimal estimation theory. The 
spike-triggered covariance matrix method has allowed 
us to demonstrate directly that both nonlinearities op- 
erate simultaneously in shaping the response of HI to 
complex, dynamic inputs — the multiplicative nonlin- 
earity is illustrated by Figs |S1 and |5J while the divisive 
nonlinearity is revealed in Figs 1101 and II II 

Although much remains to be done, the demonstra- 
tion that the nonlinearities in motion computation are 
of the form predicted from optimal estimation theory 
helps to close a circle of ideas which began with the 
observation that HI encodes near-optimal motion esti- 
mates, at least under some range of conditions 0,^3- 
If the nervous system is to achieve optimal perfor- 
mance (at motion estimation or any other task) then 
it must carry out certain very specific computations. 
Evidence for optimality thus opens a path for theories 
of neural computation based on mathematical analysis 
of the structure of the problem that the system has to 
solve rather than on assumptions about the internal 
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dynamics of the circuitry that implements the solu- 
tion. Except in the very simplest cases, testing these 
theories requires methods for analysis of the nonlinear 
structure in the neural response to complex stimuli. 

No matter one what one's theoretical prejudices, an- 
alyzing neural processing of complex, dynamic inputs 
requires simplifying hypotheses. Linearity or weak 
nonlinearity, central features of the classical Wiener 
system identification methods, are unlikely to be accu- 
rate or sufficient. The widely used concept of receptive 
field replaces linearity with a single stimulus template, 
and much effort has gone into developing methods for 
finding this single relevant direction in stimulus space. 
But already the earliest discussions of receptive fields 
made clear that there can be more than one relevant 
stimulus dimension, and that these features can inter- 
act nonlinearly. 

The methods developed here go systematically be- 
yond the "single template" view of receptive fields: we 
can count the number of relevant stimulus dimensions, 
identify these features explicitly, and in favorable cases 
map the full structure of their nonlinear interactions. 
Crucially, essential aspects of the results are clear even 
from the analysis of relatively small data sets (cf Fig 
0), so that one can make substantial progress with 
minutes rather than hours of data. Further, the possi- 
bility of visualizing directly the nonlinear interactions 
among different stimulus dimensions by sampling the 
relevant probability distributions allows us to go be- 
yond fitting models to the data; instead one can be 
surprised by unexpected features of the neuron's com- 
putation, as in recent work on the retina . 

The idea that neurons are sensitive to low dimen- 
sional subspaces within the high dimensional space of 
natural sensory signals is a hypothesis that needs to be 
tested more fully. If correct, this dimensionality reduc- 
tion can be sufficiently powerful to render tractable the 
otherwise daunting problem of characterizing the neu- 
ral processing and representation of complex inputs. 

Acknowledgements 

We thank GD Lewen and A Schweitzer for their help 
with the experiments. Discussions with N Brenner 



and N Tishby were crucial in the development of our 
ideas, and we thank also B Agiiera y Areas, AJ Doupe, 
AL Fairhall, R Harris, NC Rust, E Schneidman, K 
Sen, T Sharpee, JA White, and BD Wright for many 
discussions exploring the application of these meth- 
ods in other contexts. Early stages of this work were 
supported by the NEC Research Institute and were 
presented as part of the summer course on computa- 
tional neuroscience at the Marine Biological Labora- 
tory; we thank many students and course faculty who 
asked penetrating questions and had helpful sugges- 
tions. Completion of the work was supported in part 
by National Science Foundation Grant IIS-0423039, 
as part of the program for Collaborative Research in 
Computational Neuroscience. 



APPENDIX A: EXPERIMENTAL METHODS 

Spikes from HI are recorded with a conventional ex- 
tracellular tungsten microelectrode (FHC Inc., 3 Mil), 
using a silver wire as the reference electrode. HI is 
identified by the combination of its projection across 
the midline of the brain and its selectivity to inward 
motion 0, |U 13 • Spike arrival times are digitized to 
10 /its precision and stored for further analysis, using a 
CED 1401plus real time computer. Stimulus patterns 
are computed using a Digital Signal Processor board 
(Ariel) based on a Motorola 56001 processor, and con- 
sist of frames of nominally 200 vertical lines, written 
at a frame rate of 500 Hz. Thus the patterns are essen- 
tially 1-dimensional, but extended in the vertical direc- 
tion. They are displayed on a Tektronix 608 monitor 
(phosphor P31), at a radiance of I = 165mW/m 2 • sr. 
Taking spectral and optical characteristics of the pho- 
toreceptor lens-wave guide into account, this light in- 
tensity corresponds to a flux of ~ 4 x 10 4 effectively 
transduced photons/s in each retinal photoreceptor. 
Frames are generated in synchrony with the spike tim- 
ing clock by forcing the DSP to generate frames trig- 
gered by a 2 ms timing pulse from the CED. Angular 
dimensions of the display are calibrated using the mo- 
tion reversal response of the HI neuron |55| . 
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